Using External Tables from BigQuery

Google BigQuery has the ability to query data directly from Google Cloud Storage (a feature called "External Data Sources"). This feature can be useful when querying small amounts of data that you may not want to load into a BigQuery table. It is not recommended for large queries, however, because BigQuery billing is based on the amount of data read to process a query. BigQuery can very efficiently query subsets of tables in its own store since these are stored in columnar format, so the unused columns are not read and don't add to the cost. But since data stored in Cloud Storage is typically in the form of a compressed CSV file, typically, the entire file must be read. Hence, while querying data in Cloud Storage can he helpful, it should be used judiciously.

In this notebook we will show you how to download data from a source on the Internet, put it in Cloud Storage, and then query it directly.

Getting the Data and Loading into GCS

For this sample we want to use external data in a CSV, load it into Cloud Storage, and query it. We will use the Seattle bike station data from the Pronto 2015 Data Challenge dataset.


In [1]:
from google.datalab import Context
import google.datalab.bigquery as bq
import google.datalab.storage as gs

In [1]:
try:
  from urllib2 import urlopen
except ImportError:
  from urllib.request import urlopen
data_source = "https://storage.googleapis.com/cloud-datalab-samples/udfsample/2015_station_data.csv"

f = urlopen(data_source)
data = f.read()
f.close()

print('Read %d bytes' % len(data))


Read 4230 bytes

In [3]:
# Get a bucket in the current project
project = Context.default().project_id
sample_bucket_name = project + '-station_data'

# Create and write to the GCS item
sample_bucket = gs.Bucket(sample_bucket_name)
sample_bucket.create()
sample_object = sample_bucket.object('station_data.csv')
sample_object.write_stream(data, 'text/plain')

Creating an External Data Source Object

Now we need to create a special ExternalDataSource object that refers to the data, which can, in turn, be used as a table in our BigQuery queries. We need to provide a schema for BigQuery to use the data. The CSV file has a header row that we want to skip; we will use a CSVOptions object to do this.


In [23]:
options = bq.CSVOptions(skip_leading_rows=1) # Skip the header row

schema = bq.Schema([
  {'name': 'id', 'type': 'INTEGER'},         # row ID
  {'name': 'name', 'type': 'STRING'},        # friendly name
  {'name': 'terminal', 'type': 'STRING'},    # terminal ID
  {'name': 'lat', 'type': 'FLOAT'},          # latitude
  {'name': 'long', 'type': 'FLOAT'},         # longitude
  {'name': 'dockcount', 'type': 'INTEGER'},  # bike capacity
  {'name': 'online', 'type': 'STRING'}       # date station opened
])

drivedata = bq.ExternalDataSource(source=sample_object.uri, # The gs:// URL of the file 
                                  csv_options=options,
                                  schema=schema,
                                  max_bad_records=10)

In [24]:
drivedata


Out[24]:
BigQuery External Datasource - paths: gs://yelsayed-project1-station_data/station_data.csv

Querying the Table

Now let's verify that we can access the data. We will run a simple query to show the first five rows. Note that we specify the federated table by using a name in the query, and then pass the table in using a data_sources dictionary parameter.


In [21]:
bq.Query('SELECT * FROM drivedatasource LIMIT 5', data_sources={'drivedatasource': drivedata}).execute().result()


Out[21]:
idnameterminallatlongdockcountonline
42nd Ave & Blanchard StBT-0547.61311-122.3442081410/13/2014
36th Ave & Blanchard StBT-0447.616094-122.3411021610/13/2014
22nd Ave & Vine StBT-0347.615829-122.3485641610/13/2014
13rd Ave & Broad StBT-0147.618418-122.3509641810/13/2014
52nd Ave & Pine StCBD-1347.610185-122.3396411810/13/2014

(rows: 5, time: 2.4s, 4KB processed, job: job_K_B1UXklE05N5hkcN2SINo33tBg)

Finally, let's clean up.


In [25]:
sample_object.delete()
sample_bucket.delete()